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Abstract 

In active learning, the user sequentially chooses values for feature X and an oracle returns 
the corresponding label Y. In this paper, we consider the effect of feature noise in active 
learning, which could arise either because X itself is being measured, or it is corrupted in 
transmission to the oracle, or the oracle returns the label of a noisy version of the query 
point. In statistics, feature noise is known as “errors in variables” and has been studied 
extensively in non-active settings. However, the effect of feature noise in active learning has 
not been studied before. We consider the well-known Berkson errors-in-variables model with 
additive uniform noise of width a. 

Our simple but revealing setting is that of one-dimensional binary classification setting 
where the goal is to learn a threshold (point where the probability of a -|- label crosses half). 
We deal with regression functions that are antisymmetric in a region of size a around the 
threshold and also satisfy Tsybakov’s margin condition around the threshold. We prove min¬ 
imax lower and upper bounds which demonstrate that when cr is smaller than the minimiax 


active/passive noiseless error derived in Castro & Nowak (20071, then noise has no effect on 
the rates and one achieves the same noiseless rates. For larger a, the unflattening of the 
regression function on convolution with uniform noise, along with its local antisymmetry 
around the threshold, together yield a behaviour where noise appears to be beneficial. Our 
key result is that active learning can buy significant improvement over a passive strategy 
even in the presence of feature noise. 


1 Introduction 


Active learning is a machine learning paradigm where the algorithm interacts with a label¬ 
providing oracle in a feedback driven loop where past training data (features queried and cor¬ 
responding labels) are used to guide the design of subsequent queries. Typically, the oracle is 
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queried with an exact feature value and the oracle returns the label corresponding precisely to 
that feature value. However, in many scenarios, the feature value being queried can be noisy and 
it helps to analyze what would happen in such a setting. Such situations include noisy sensor 
measurements of features, corrupted transmission of data from source to storage, or just access 
to a limited noisy oracle. 


The errors-in-variables model has been well studied in the statistical literature and their effect 
can be profound. In density estimation, Gaussian error causes the minimax rate to become 


logarithmic in sample size instead of polynomial, see Fan (1991). For results in passive regression. 


refer to 

Fan et al. 

(1993 

); 

Fuller 

(2009 

); 

Carroll et al. 

(2010 

I, and for passive classification. 

see 

Loustau & Marteau 

20121. However, classification has not been studied in the Berkson 


model introduced below. Also 

be bounded away from zero, ruling out uniform noise, 
feature noise has not been studied for active learning in any setting. 


deconvolution estimators require the noise fourier transform to 

Finally, to the best of our knowledge. 


The classical errors in variables model has the graphical form W X ^ Y, representing 


W = X + 5 , 

Y = m{X) + e . 

Here, the label Y depends on the feature X but we do not observe A; rather we observe the 
noisy feature W. The Berkson errors in variables model is 

X = W + 5 , 

Y = m{X) + e . 

The difference is that we start with an observed feature W and then noise is added to determine 
X. Graphically, this model is IF —>■ A —>• F. 

In this paper, we focus on the Berkson error model since it intuitively makes more sense for active 
learning - it captures the idea that we request a label for feature W, but the oracle returns the 
label for A which is a corrupted version generated from W, i.e. the noise occurs between the 
label request and the oracle output. We use uniform noise since it yields insightful behavior and 
also has not been addressed in the literature. We conjecture that qualitatively similar results 
hold for other symmetric error models. 


1.1 Setup 


Threshold Classification. Let X = [—1,1], y = {+,—}, and f ■. X ^ y denote a classifi¬ 
cation rule. Assuming 0/1 loss, the risk of the classification rule / is R{f) = E[l{y(jf)^Y}] = 
P(/(A) ^ Y). It is known that the Bayes optimal classifier, the best measurable classifier that 
minimizes the risk f* = argminy R{f), has the following form 


r(^) = 



if m{x) > 1/2 , 
if m{x) < 1/2 , 


where m{x) = P(A = -l-|A = x) is the unknown regression function. In what follows, we will 
consider the case where the /* is a threshold classifier, i.e. there exists a unique t € [—1,1] with 
m{t) = 1/2 such that m{x) < 1/2 if a; < t, and m{x) > 1/2 it x > t. 
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Berkson Error Model. The model is: 

1. User chooses W and requests label. 


2. Oracle receives a noisy W namely X = W + U. 

3. Oracle returns Y where P(U = +|X = x) = m{x). 

We take the noise to be uniform: U ~ Unif[— ct, cr], where the noise width a is known for 
simplicity. 


Sampling Strategies. In passive sampling, assume that we are given a batch of Wi ^ Unif [— 1,1] 
and corresponding labels pi sampled independently of {wj}j^i and {yj}j^i. In this case, a strat¬ 
egy S is just an estimator Sn ■ {W x Y)^ — >• [—1,1] that returns a guess t of the threshold t on 
seeing 

In active sampling we are allowed to sequentially choose Wi = Si{wi,... ,Wi-i,yi,... ,yi-i), 
where Si is a possibly random function of past queries and labels, where the randomness is 
independent of queries and labels. In this case, a strategy ^ is a sequence of functions Si : 
{W X Yy~^ — 1,1] returning query points and an estimator Sn ■ {W x F)” —>• [—1,1] that 
returns a guess t at the end. 

Let Sy,Sy be the set of all passive or active strategies (and estimators) with a total budget of 
n labels. 

To avoid the issue of noise resulting in a point outside the domain, we make a (Q)uerying 
assumption: 

(Q). Querying within a of the boundary is disallowed. 


Loss Measure. Let t = t{Wy, F”) denote an estimator of t using n samples from a passive or 
active strategy. Our task will be to estimate the location of t, where we measure accuracy of an 
estimator t by a loss function which is the point error \t — t\. 


Function Class. In the analysis of rates for classification (among others), it is common to 
use the Tsybakov Noise/Margin Condition (see [Tsybakov (2004|), to characterize the behavior 
of m{x) around the threshold t. Given constants c,C with C > c, fc > 1, and noise level a, let 
V{c, C, k, a) be the set of regression functions m{x) that satisfy the following conditions (T,M,B) 
for some threshold t: 


(T). \x — ty~^ > \m{x) — 1/2] > cja: — whenever \m{x) — 1/2] < cq for some constant cq 
(M). m{t -f 5) — 1/2 = 1/2 — m{t — S) for all S < a. 

(B). t is at least cr away from the boundary. 

On adding noise U, the point where m^kU means convolution) crosses half may differ from 
t, the point where m crosses half. However, the antisymmetry assumption (M) and boundary 
assumption (B) together imply that the two thresholds are the same. Getting rid of (M,B) seems 
substantially difficult. 


When cr = 0, (Q), (M) and (B) are vacuously satisfied, and this is exactly the class of functions 
and strategies considered in Gastro & Nowak (2007). Smaller k means that the regression function 
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is steeper, which makes it easier to estimate the threshold and classify future labels (cf. Steinwart 


& Scovel (2004)). k = \ captures a discontinuous m{x) jumping at t. 


Minimax Risk. We are interested in the minimax risk under the point error loss : 

'R-7i{'P{c,C,k,a)) = inf sup E|t —1| (1) 

SeS„ p^v(c,c,k,a) 

where is the set of strategies accessing n samples. For brevity, TZ^{k,a) or TZ^{k,a) denotes 
risk for (P)assive/(A)ctive sampling stratgies S^,S^. 

Notation We analyse minimax point error rates in different regimes of cr as a 

function of n (or equivalently, for a given point error, we can analyse how the sample size n 
depends on a) and we write (t„ for emphasis. In this paper, /„ ^ means fn/gn —t 0, /„ x Qn 
means cign < fn < C25'„ where ci, C2 are constants, fn di gn means /„ ^ or fn gn, fn h gn 
means ^ /„ and /„ gn means gn ^ fn- 


2 Main Result and Comparisions 


The main result of this paper is as follows. 


Theorem 1. Under the Berkson error model, when given n labels sampled actively or passively 
with assumption (Q), and when the true underlying regression function lies in V{c,C, k,an) for 
known k,an, the minimax risk under the point error loss is: 


1. K{r{k,<j)) 

2. ni{V{k,a)) 


1 

1 2fc-l 

if an<n 

-(fc-i) 

n ^ 

J- otherwise 

W n 

1 

1 

1 2k-2 

ifon^n 2fc-2 

.-(fe-2) 

n \ 

/- otherwise 

1 n 


When fc = 1, m(x) jumps at the threshold, and we interpret the quantity n~ as being 
exponentially small, i.e. being smaller than n~^ for any p. We also suppress logarithmic factors 
in n, (T„. If the domain was [—i?, R], the corresponding passive rates are obtained by substituting 
n by n/R, but active rates remain the same upto logarithmic factors in R. 


Remark. In this paper, we focus on learning the threshold t. This is relevant because the 
threshold maybe of intrinsic interest, and also of interest for prediction if, for example, future 
queries could be made with a different noise model or can be obtained (with some cost) noise-free. 
Similar results can be derived for 0/1-risk. 


Zero Noise. When cr = 0, the assumptions (Q,B,M) are vacuously true, and our class 'P{c, C, k, 0) 
matches the class V{c,C,k) considered in Castro & Nowak (2007), and our rates for cr = 0 i.e. 
n~ and n~ are precisely the passive and active minimax point error rates in 


Castro & 


Nowak (2007). 
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Small Noise. When the noise is small, we get what we expect - the risk does not change with 

noise as long as the noise itself is smaller than the noiseless error. In other words, as long as 

_ 1 

the noise is smaller than the noiseless error rate of n for passive learning, passive learners 

will not really be able to notice this tiny noise, and the minimax rate remains n~ ^'=-1. Similarly, 

as long as the noise is smaller than the noiseless error rate of n~ for active learning, active 

_ 1 

learners will not really be able to notice this tiny noise, and the minimax rate remains n . 
Also, the passive rates vary smoothly - at the point when (t„ x n~ , the rates for small and 
large noise coincide. Similarly, at the point when cr„ x n~ , the aforementioned active rates 
for small and large noise coincide. 


Large Noise and Assumption (M). When the noise is large, we see a curious behaviour 
of the rates. When k > 2, the error rates seem to get smaller/better with larger noise for both 
active and passive learning, and furthermore the noisy rates can also be better than the noiseless 
rate! This might seem to violate both the information processing inequality, and our intuition 
that more noise shouldn’t help estimation. Moreover, a noiseless active learner may be able to 
simulate a noisy situation by adding noise and querying at the resulting point, and get better 
rates, violating lower bounds in Castro & Nowak (2007). 


However, we make the following crucial but subtle observation. Our claimed rates are not about 
a fixed function class - due to assumption (M), the function class changes with cr, and in fact 
(M) requires the antisymmetry of the regression function to hold over a larger region for larger 
a. This set of functions is actually getting smaller with larger a. Even though the functions can 
behave quite arbitrarily outside {t — a,t + a), this assumption (M) on a small region of size 2a 
actually helps us significantly. 


Given that there is no contradiction to the results of Castro & Nowak (2007) or more funda¬ 
mental information theoretic ideas, there is also an intuitive explanation of why assumption (M) 
helps when we have large noise. As we will see in a later figure, convolution with noise seems to 
“stretch/unflatten” the function around the threshold. Specifically, for larger k > 2, the regres¬ 
sion function can be quite flat around the threshold - convolution with noise makes it less flat 
and more linear - in fact it behaves linearly over a large region of width nearly 2a. This is true 
regardless of whether assumption (M) holds - however if (M) does not hold, then the convolved 
threshold, which is the point where the convolved function crosses half, need not be the original 
threshold t. While dropping assumption (M) will not hurt if we only want to find the convolved 
threshold, but given that our aim is to estimate t, the problem of figuring out how much the 
threshold shifted can be quite non-trivial. 


Hence, large noise ensures a behaviour that is less flat and more linear around the threshold, and 
assumption (M) ensures that the threshold doesn’t shift from t. Intuitively this is why (M) and 
large noise help, and technically there is no contradiction becasue the function class is getting 
progressively simpler because of more controlled growth around the threshold. 

The main takeaway is that in all settings, active learning yields a gain over passive sampling. We 
now describe the upper and lower bounds that lead to Theorem 1. The case fc = 1 is handled in 
detail for intuitionb but proofs for fc > 1 are in the Appendix. 
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Figure 1: Regression function f}{x) (red) before and F{w) (blue) after convolution with noise. 
In all 3 figures, Tsybakov’s margin condition holds for x G [0.4, 0.6]. The top plot has a linear 
regression function (k = 2), and its two blue curves are for (T„ = 0.05 (narrow), 0.2 (wide), and 
they show that a linear growth around t = 0.5 remains linear. The middle and bottom figure are 
for a flatter regression function with k = 4, and cr„ = 0.05,0.2 respectively, plotted separately 
for clarity, fc = 4 is harder than for k = 2 because the red curve is flatter around t, making it 
harder to pinpoint the threshold. However, as one can see in both plots, noise actually helps by 
smoothing it out and making it more linear. However, note that the effect of assumption (M) 
cannot be understated, due to which in all plots the threshold before and after noise cross half 
at the same point. The effect of noise when k = 1 can be seen in the following section. 

2.1 Simulation of Noise Convolution 

2.2 Paper Roadmap 

We devote the next two sections to proving the lower and upper bounds, in that order, that lead 
to Theorem 1. While the proofs will be self-contained, we leave some detailed calculations to the 
appendix. 

For easier readibility, we present lower bounds for k = 1 first to absorb the technique and then 
the lower bounds for fc > 1. In Section 2 we will prove 

Theorem 2 (Lower Bounds). Under the Berkson error model and assumption (Q), 
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1. For k = 1, the passive/active lower bounds are 


inf sup E|f —1\ F 



if an F ^ 
otherwise 


inf sup Elf —1\ F 

PgP(i^o'„) 


e "■ if an F e 
otherwise 

yjn 


2. For k > \, the passive/active lower bounds are 


inf sup E|f —f| ^ 

SG5^ pgp(fc^cr„) 


if an 


-< n 2 fe-i 


-(fc-§) 



otherwise 


inf sup E|< —1\ F 
SeSf pgp(fc,o-„) 


1 

2k-2 


if an -< n 


1 

2k-2 


-(fc-2) 



otherwise 


Following that, we again present active and passive algorithms for k = 1 first to gather intuition 
and then generalize them for fc > 1. In Section 3 we will prove 


Theorem 3 (Upper Bounds). Under the Berkson error model and assumption (Q), 

1. For k = \, a passive algorithm (WIDEHIST) and an active algorithm (ACTPASS) return 
t s.t. 

^ ^ if an F - 

, J H 


sup E|t — t| ^ 
PG'P(l,<T„) 


sup E|f — t\ F 

Pe-p(i,<T„) 


otherwise 

e“" if an F e“” 
otherwise 


2. For k > 1, a passive algorithm (WIDEHIST) and an active algorithm (ACTPASS) return 
t s.t. ^ ^ 

^ (n~ if an -< n~ 

sup E|t — t\ F 
PeP(fe,CT„) 


sup E|f — t\ :< 
PGP(fe,o-„) 


-(fe-f) 

n \ 

/ i otherwise 

1 

. 1 

t 2fc-2 

f an <n 2 fc -2 

-(fc-2) 

n 

— otherwise 

n 


3 Lower Bounds 


To derive lower bounds, we will follow the approach of Ibargimov fc Hasminskii| ( 19811; Tsybakov 
(20091 which were exemplified in lower bounds for active learning problems without feature noise 
Castro &: Nowak| (|2007 2008). The standard methodology is to reduce the problem of classi¬ 


fication in the class P(c, C, k, a) to one of hypothesis testing. Similar to Castro & Nowak (2007 
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20081, it will suffice to consider two hypotheses and use the following version of Fano’s lemma 


from Tsybakov (20091 (Theorem 2.2). 


Theorem 4 (Tsybakov ( 2009| )). Let tF be a class of models. Associated with each f G if we 
have a probability measure Pj defined on a common probability space. Let d{.,.) : P,iF ^ M. 
be a semi-distance. Let /o,/i G T be such that d(/o,/i) > 2a, with a > 0. Also assume that 
KL{Pfg^Pffi) < 7 , where KL denotes the Kullback-Leibler divergence. Then, the following bound 
holds: 


inf sup Pf{d{f, f)> a) > inf max Pf {d{f, fj) > a) 
f t jefo.i} 


> 



=: p 


where the inf is taken with respect to the collection of all possible estimators of f based on a 
sample from Pf. 


Corollary 5. If ^ is a constant, then p is a constant, and by Markov’s inequality, we would get 

inf sup Ed{f, f) > pa 
/ 


and the minimax risk under loss d would be P a. 


Proof of Theorem 2, fc = 1. Choose P = P(l,cr„). Let Pt G P(l,cr„) denote a regression 
function with threshold at t. We choose the semi-metric to be the distance between thresholds, 
i.e. d(Pr, Ps) = \r — s|. We now choose two such distributions with thresholds at least 2a„ apart 
(we use On to explicitly remind the reader that a will later be set to depend on n) - let them be 
denoted Ptg and Pt^ with to = —Onfii = On and 


PfiY = +\X = x) 


0.5 — c X < t , 
0.5 -I- c X > t . 


Due to addition of noise, we get convolved distributions P° = Ptg iY\W) and P^ := Pt^ {Y\W). 

As hinted by the above corollary, we will choose a„ so that KL{P^, P^) is bounded by a constant, 
to get a lower bound on risk ^ a„. This follows by the following argument from|Castro & Nowak| 

(p^. 
























The KL{P^,P^) can be bounded as 


^W,Y 


^W,Y 


^W,Y 


log 

log 

log 


pO(^n^yn) 

U^P\y^\Wi)P{W,\Wl-\Y|-^) 

U^P^iY,\W^)P{W,\Wl-\Yr^) 

u.pHy^m) 


n.p°mw^)\ 


= 




log 

log 


pHy^\w^) 

p^iYm) 

P\Y\W) 

po{Y\W) 




W = ' 


< n max Ei^ 

-U)G[—1.1] 

^ n max {P^{Y\w) - P°{Y\w))^ 

U)G[- 1 . 1 ] 


( 2 ) 

(3) 

(4) 

(5) 

( 6 ) 


where (j^ holds for active learning because the algorithm determines Wi when given {Wl~^ ,Y^~^} 
and is independent of the model, and follows by the independence of future from past for passive 
learning. Q holds by law of iterated expectation. ([^ is used for active learning but is not needed 
for passive learning. ([^ follows by an approximation 

KL{Ber{l/2 + p), Ber{l/2 + q)) <{p- qf 


for sufficiently small constants p, q. 


P(Y=+|X=x) 

1/2+A, 

1/2 - 

1/2-A, - 


mo 



mi 


0 


^0 


X 


P(Y=+|W=w) 

1/2+A. 

1/2 - 



1/2-A 

0 - 


/ fni 


-P-b 


^0 P 


X 


Figure 2: Regression functions before (top) and after (bottom) convolution with noise. 

Ft^iu) := Pt{Y\W = w) = f Pt{Y\X)P{X\W = w)dX and a straightforward calculation reveals 
that 

{ 0.5 — C W < t — (Jn , 

0.5+^(w-/) w €[t-an,t + an] , (7) 

0.5 + C W > t + (Jn ■ 

As depicted in Figj^ note the behavior before and after convolution with noise: (i) m(t) = 
F{t) = 1/2, hence Pi(a„) = 1/2 = Fo(—^n) (ii) Both convolved regression functions grow 
linearly for a region of width 2 (t„, and differ only on a width of 2((T„ -|-a„); (iii) For a large region 
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[a„ — cr„, —a„ + UnJ of size 2(tT„ — a„), we have |^i('u;) — ^ 0 ( 1 /;)! = ^ancjon-, a constant. Their 
gap varies when (t„ ^ a„ as |T'o('w) — Fi{w) \ = 


W + ttr, 

2a„^ 


CTn -Z- 


((a„ + an) 


w G [-an - an, an - an] 
w € [an - an, -an + an] 
w G [-an + an, an + an] 
otherwise. 


When an < an, |i^i(w) - Fq{w) 


[w + an + anj ^ 
2c 

(^(a„ + an) 

0 


w G [-an - an,-an + an] 
W G [ an F an, an an] 
w € [an - an, an + an] 
otherwise. 


For active learning, when (t„ ^ o„ we note 

r* 

max [P^(Y[w) — P°{Y[w)[ = —— 

1 , 1 ] (Jn 


and get KL{P°, P^) F by Eq.(6). We choose a„ x 
error rate by Corollary when cr„ ^ an i.e. cr„ ^ e“”. 


which becomes our active minimax 


Similarly, if cr„ ^ exp{—n}, setting a„ x exp{—n} easily gives us an exponentially small lower 
bound. 


In the passive setting, Eq.([^ does not apply. Since the two convolved distributions differ only 
on an interval of size 2((7„ + a„), the effective number of points falling in this interval would be 
X n{an + an). 

When an Cl CLn, Su simple calculation shows 


KL{P\P^) dc n(a„ + a„)^ 


giving rise to a choice of an x , which is the passive minimax rate when ^ an i.e. 
anP -■ 

When an F a similar calculation shows 

KL{P^,P^) < n(an + an)4c^ x na„ 

giving rise to a choice of x i, which is the passive minimax rate when cr„ F an i.e. (t„ ^ 


Proof of Theorem 2, fc > 1 We follow a very similar setup to the case k = 1. The difference 
will lie in picking functions that are in P{c, C,k,an) for general k ^ 1, and calculating the bounds 
on KL divergence appropriately. However, for notational convenience, we will assume that the 
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domain is shifted to [— (j„,2 — (t„] instead of [—1,1] and that the distance between thresholds is 
o„ instead of 2a„. Define 


P^{Y\x) 


l/2 — c\x\^ Mf a; e [—(T„, 0] 
1/2 + if a; > 0 


{ 1/2 - c|a; - an\^~^ if a; G [-cr„, a„] 

1/2 + c|a; - if a; G [a„, /3a„ + ct„] 

l/2 + c|a;|^“^ if a; >/?a„ + (J„ 

where f3 = > 1 is a constant chosen such that Pi G V{c,C,k,an) (this fact is 

verified explicitly in the Appendix). For ease of notation, Po,Pi are understood to actually 
saturate at 0,1 if need be (i.e. we are implicitly working with min{Po/i,l}, etc). The two 
thresholds are clearly at 0, o„ respectively, and after the point /3a„ + cr„, the two functions are 
the same. Continuing the same notation as for k = 1, we let P® = Pi{Y\W) = Fi{w) for 
1 = 0 , 1 . 

The following claims hold true (Appendix). 

1. When CT„ ^ a„, max^, |Pi(ac) — F2{w)\ ^ ■ 

2. When P a„, max^, |Pi(ic) — F 2 {w)\ x cr^“^a„. 

3. As a subpart of the above cases, when ct„ x a„, max^, |Pi(ic) — ^2(^)1 x x 

If the above propositions are true, we can verify: 

1. In the first case, KL{P^,P^) ^ hence a„ x n~ 2 '=-= is a lower bound when cr„ ^ 

_ 1 

n 2 fc- 2 , 

— (fc —2) 

2. Otherwise, KL{P^,P^) ^ hence a„ x — is a lower bound when (t„ 

_ 1 

n 2 '=- 2 . 

The passive bounds follow by not just considering the maximum difference between |Pi(w) — 
P 2 ('ti')| but also the length of that difference, since it is directly proportional to the number of 
points that may randomly fall in that region. Following the same calculations, 

1. When (T„ ^ a„, |Pi(i«) — F 2 {w)\ x for all w G [0,/3a„ + 2(j„]. Hence KL{P^,P^) ^ 
n{j3an + 2an)a^~^ x na^~^ and a„ x n~ 2 '=-i is the minimax passive rate when (j„ ^ 

2. When (j„ a„, \Fi{w) — F2{w)\ x a^~‘^an for all w G [0,/3a„ + 2 (t„]. Hence KL{P^,P^) ^ 

n{l3an+2an)<7'^~‘^cL^ and a„ x ^ ^ is the minimax passive rate when (t„ n~ 2 '=~i. 

as verified from the Appendix calculation. ■ 


4 Upper Bounds 

For passive sampling, we present a modified histogram estimator, WIDEHIST, when the noise 
level is larger than the noiseless minimax rate of 1/n. Assume for simplicity that the n 
sampled points on [—1,1] are equally spaced to mimic a uniform distribution, lying at 
j = l,...,n. 
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Algorithm WIDEHIST. 

1. Divide [—1,1] into m bins of width h>^ so m=^<n. The i**' bin covers [—1 + (i — 
l)h, —1 + ih], i € {1, m} and hence each bin has ^ points. Let bi be the average number 
of positive labels in bin i of these ^ points. 

2. Let Pi be the average of the 6i’s over a all bins within ±(t„/ 2 of bin i. We “classify” regions 
with Pi < 1/2 as being — and pi > 1/2 as being +, and return t as the center of the first 
bin from left to right where % crosses half. 

Observe that we need not operate on [—1,1] with n queries - WIDEHIST(D,B) could take as 
inputs any domain D and any query budget B. The argument below hinges on the fact that the 
convolved regression function behaves linearly around t. 


Proof of Theorem 3, fc = 1, (Passive). Let i* G {1,..., m} denote the true bin [(i* — l)h, i*h] 
that contains t. Let t be from bin i, i.e. pj < 1/2 and > 1/2. We will argue that i is very 
close to i*, in which case the point error we suffer is \i — i*\h. Specifically, we prove that all bins 
except I* = {i* — l,i* ,i* + 1} will be “classified” correctly with high probability. In other words, 
we claim w.h.p. pi < 1/2 ii i < i* — 1 and pi > 1/2 if i > + 1. 

Indeed, we can show (Appendix) 

For i> i* + 2, E[pi] > E[pi.+ 2 ] > 1/2 + -^h (8) 

For i<i*-2, E[p,] < E[p,._ 2 ] < 1/2 - (9) 

Using Hoeffding’s inequality, we get that for bin i, Pr(|pi — Pi\ > e) < 2exp {—2^^e^} Taking 

union bound over all bins other than those in — 1, z*, i* + 1 and setting e = -^h, we get 

Pi{yi\r,\pi-p,\> ^h) < 2mexp|-2^ | 

So we get bins i\I* correct and i G {i* — 1, i*,i* + 1} with probability 

since m < n. Setting h = ^ \J^ log(x) mEikes this hold with probability > 1 — <5 so the point 
error \i — i*\h < 2h behaves like h ^ ® 

For active sampling when the noise level (T„ is larger than the minimax noiseless rate 6“", we 
present a algorithm ACTPASS which makes its n queries on the domain [—1,1] in E different 
epochs/rounds. As a subroutine, it uses any optimal passive learning algorithm, like WIDE- 
HIST(D,B). In each round, ACTPASS runs WIDEHIST on progressively smaller domains D 
with a restricted budget B. Hence it “activizes” the WIDEHIST and achieves the optimal ac¬ 
tive rate in the process. This algorithm was inspired by a similar idea from |Ramdas & Singh] 
( [201^ . 

Algorithm ACTPASS. 

Let E = [log(l/CT„)] be the number of epochs and Di = [—1,1] denote the domain of “radius” 
i?i = 1 around to = 0- The budget of every epoch is a constant B = n/E. For epochs 1 < e < A, 
do: 

1. Query for B labels uniformly on Dg. 

2. Let te = WIDEHIST(Zle, H) be the returned estimator using the most recent samples and 
labels. 


> 1—2nexp |—riCTn | 
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3. Define Dg+i = [ig — 2 ig + 2 ®] n [—1,1] with a radius of at most i?e+i = 2 ® around tg. 
Repeat. 

Observe that ACTPASS runs while i?g > ct„, since by design E > log(l/CT„) so (j„ < 2“^ = 
Re+1- 


Proof of Theorem 3, fc = 2, (Active). The analysis of ACTPASS proceeds in two stages 
depending on the value of (t„. Initially, when i?g is large, it is possible that (t„ ^ Rein and in 
this phase, the passive algorithm WIDEHIST will behave as if it is in the noiseless setting since 
the noise is smaller than its noiseless rate. However, after some point, when Re becomes quite 
small, cr„ h i?g/n is possible and then WIDEHIST will behave as if it is in the noisy setting since 
noise is larger than its noiseless rate. Observe that it cannot stay in the first phase till the end 
of the algorithm, since the first phase runs while cr„ ^ Rejn but we know that (j„ > Re+i by 
construction, so there must be an epoch where it switches phases, and ends the algorithm in its 
second phase. 

We prove (by a separate induction in each epoch) that with high probability, the true threshold 
t will always lie inside the domain at the start of every epoch (this is clearly true before the first 
epoch). We claim: 

1. Before all e in phase one, t G De w.h.p. 

2. Before all e in phase two, t G De w.h.p. 

We prove these in the Appendix. If these are true, then in the second phase, WIDEHIST is in 
the large noise setting and it gets an error of \J■ Hence the final error of the algorithm is 

I RElyn. ^ Etl ■ 

Y n/E y/n' 


Proof of Theorem 3, A: > 1. The proofs for fc > 1 are simply generalizations of those for 

k = 1. Again, we present concise arguments here for the settings where the algorithm can actually 

detect noise, i.e. when the noise level is larger than the noiseless minimax rate (otherwise, one 
can argue that algorithms which worked for the noiseless case will suffice). In both cases, the 
algorithm remains unchanged. 

1. We outline the proof for WIDEHIST when (j„ ^ n~ ='“-1. Using similar notation as before, we 

will again show that if t is in bin i* of width h < an, then except for bins i* — + I, we 

will ’’classify” all other bins correct with high probability, by averaging over the nonj^ points to 
the left and right of that bin. Specifically, we claim 

For i > U + 2, E[pi] > E[pi.+ 2 ] > I/2 + Act^“^/i (10) 

For i < i* — 2, E[pi] < E[pi._ 2 ] < 1/2 —(II) 

A similar use of Hoeffding’s inequality gives 

Pr{yi\r,\pi-p,\ > Act^“^/i) < 

2mexp {-2(^^)/i2AVf-4} . 

Arguing as before, w.h.p. we get a point error of /i ^ ^ when (t„ n~ . 

2. We outline the proof for ACTPASS when (t„ ^ ^s before, the algorithm runs in two 

phases, and we will prove required properties within each phase by induction. 
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The first phase is when Rf. is large and so (T„ may possibly be smaller than and 

WIDEHIST will achieve noiseless rates within each epoch. In the second phase, after i?e has 
shrunk enough, (t„ will become larger than {R^/n) and WIDEHIST will achieve noisy rates 
in these epochs. 


One can verify, as before, that the second phase must occur, by design. Intuitively, the second 
phase must occur because we make a fixed number of queries n/E x n/logn in a halving 
domain size (equivalently we make geometrically increasing queries on a rescaled domain), and 
so relatively in successive epochs this noiseless error shrinks, and at some point (t„ becomes larger 
than this shrinking noiseless error rate. 

As before we make the following claims: 

1. Before all e in phase one t G w.h.p. 

2. Before all e in phase two t G w.h.p. 


These are proved in the Appendix by induction. 


The final point error is given by WIDEHIST in the last epoch as 
Re ^ cFn and E x log n. 




since 


5 Conclusion 


In this paper, we propose a simple Berkson error model for one-dimensional threshold classifi¬ 
cation, inspired by the setup and model analysed in Castro & Nowak (2007 20081, in which we 
can analyse active learning with additive uniform feature noise. To the best of our knowledge, 
this is the first attempt at jointly tackling feature noise and label noise in active learning. 


This simple setting already yields interesting behaviour depending on the additive feature noise 
level and the label noise of the underlying regression function. For both passive and active 
learning, whenever the noise level is smaller than the minimax noiseless rate, the learner cannot 
notice that there is noise, and will continue to achieve the noiseless rate. As the noise gets 
larger, the rates do depend on the noise level. Importantly, one can achieve better rates than 
passive learning in most scenarios, and we propose unique algorithms/estimators to achieve 
tight rates. The idea of “activizing” passive algorithms, like algorithm ACTPASS did, seems 
especially powerful and could carry forward to other settings beyond our paper and Ramdas 


Singh (20131. 


The immediate future work and most direct extension to this paper concerns the main weakness 
of the paper - the possibility of getting rid of Assumption (M), which is the only hurdle to a 
fair comparision with the noiseless setting. We would like to re-emphasize that at first glance, 
the rates may be misleading and counterintuitive because it “appears” as if larger noise could 
possibly help estimation due to the presence of cr„ in the denominator for larger k. 


However, we point out once more that the class of functions is not constant over all (t„ - it 
depends on and in fact it gets “smaller” in some sense with larger cr„ because the assumption 
(M) becomes more stringent. This observation about the non-constant function class, along with 
the fact that convolution with uniform noise seems to unflatten the regression function as shown 
in the figures, together cause the rates to seemingly improve with larger noise levels. 
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Analysing the case without (M) seems to be quite a challenging task since the noiseless and 
convolved thresholds can be different - we did attempt to formulate a few kernel-based estimators 
with additional assumptions, but do not presently have tight bounds, and leave those for a future 
work. 
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A Justifying Claims in the Lower Bounds 


Approximations: 

1. {x + uY = x'^{l + y/x)^ ~ x^ + kx^~^y when y < x. Even when y <x, both terms are the 
same order. 

2. {x — y)^ = x^{\ — y/x)^ k, x^ — kx^~^y when y ^ x. Even when y d: x both terms are the 
same order. 

3. When y < x but not y ^ x, hy Taylor expansion of (1 + around z = 0, we have 
{x + y)^ = x^{l + y/x)^ = a;^[l + (1 + c)’^~^y/x\ = a;* + Cx^~^y for some 0 < c < y/x < 1 
and some constant C. Similarly for {x — y)^. 

Let’s assume the boundary is at —a for easier calculations, (we denote an,crn as a, cr here). 
Remember 

mi(a;) = 1/2 + ca;|a;|*“^ if a; > —cr 

1/2 + c(a; — a)|a; — a|^“^ if a; </3a + cr 
mi(a;) if a; > /3a + cr 

where /3 = > 1 is such that m 2 € P{K,c,C,a). Clearly, when x < (3a + a, m 2 

satisfies condition (T). So, we only need to verify that whenever x > j3a + a we have 

7712 ( 3 ;) — 1/2 = cx^~^ < C{x — a)^~^ (12) 

This statement holds iff (c/C)^/*^^”^^ < 1 — a/a; <J4> a/a; < 1 — (c/C)^/^*“^) <J4> x >/3a, which 
holds for all CT > 0, and hence m 2 satisfies condition (T). 

Proposition 1. When a a, maxu, |Ei(w) — F 2 ('u;)| x 
Proposition 2. When a >- a max^, |E’i(w) — F 2 {w)\ x a^~^a 

Let us now prove these two propositions, with detailed calculations in each case (note that when 
cr X a, then max^, |Fi('u;) —F 2 ( 7 (;)| x x cr*“^a, and can be checked using our approximations 
1,2,3). 

1. When a ^ a, we will prove proposition 1. Remember that we can’t query in —a < w < 0. 
(a) When 0 < w < a, we have 

Fi(w) = (mi *l/)(w) = 


pO nW-\-(7 

/ (1/2 — cx|x|*“^)(ix/2cr + / (1/2 + cx^“^)fix/2cr 

Jw—a J 0 

1(2 + + w/af-{I-w/af] 

1/2 + ca^~‘^w 



F 2 (w) = {m 2 * U){w) 


PW-\-<T 


(1/2 — c{x — a)\x 


l/2-^[(a-7c-a)^ 
1/2- c{a-w)^-^ 


— a|^ ^)dx/2a 

- (a + a - w)'^] 


16 



[Boundaries: Fi( 0 )—i = 0 , Fi(cr) —i X cr'^ ^2(0)—^ x —\ F2 (ct)— 5 x —i], 

Fi{w) — F2 {w) :< 

(b) When a < w < a — a 

pW -\-<7 

Fi{w) = {mi-kU){w) = / { 1/2 + cx'^~^)dx/ 2 a 

J W — (7 




1/2 + cw 

pw-\-a 


k-1 


pw-\-a 

F2{'w) = {m2*U){w) = / {\/2 — c{x — a)\x — a\^~'^)dx/ 2 a 

J w — a 


1/2 - ^[(a - w - 0-)'= - (a + cr - w)''] 
1/2 — c(a — w) 


k-1 


[Boundaries: Fi (a) — | x ^, _Fi (a — a) — | x ^, _F2 ^ ^ ^ 7 ^2 — 


ix-u'^-i] 


Fi{w)-F2{w) = cw'^-^ + c{a - 

< c(a — cr)^“^ + c(a — 


^ a 


k-1 


(c) When a — a < w < 


Fi{w) ^ 1/2 +era 


k-1 


pa pw-\-(T 

F2{w) = / (1/2 — c(x — a)|x — a|*“^)(ia:/2cr + / 1/2 + c{x — a)’^~^dx/ 2 a 

J w—<7 J a 


— [ 
2 ak 

^k-2( 


1/2 —ca^ ^{a — w) 


+ 0] 


- w + (j)^ — (li; + cr - 

-+] 

— w) 



<a'=-i,Fi(a) 

)( 

1 

a’^-\F2{a-a)- 

- F2{w) « 


+ ca^~‘^{a — w 

< 

ca^“^ 

+ ca^~‘^a 


^ a 


k-1 


(d) When a < w < a + a 


Fi{w) « 1 / 2 + , 


..k-1 


F 2 {w) « 1/2 + cct *^ ^{a — w) 

[Boundaries: -F’i(a) — h ^ a^~^, J^i(a + cr) — ^ x a^~^, F2{a) — ^ = 0 , -^2(0 + a) — ^ 


T-fe-ll 


Fi{w) — F2 {w) :< a 


k-1 
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(e) When o + ct < 

[B: Fi{a + u)- 

(f) When j3a — a I 

F2{w) 

[Fi(/3a-CT)-i 
Fi{w) - F 2 {w) 

(g) When /3a + (J ! 

F2{w) = 


w < j5a — a 

Fi{w) « 1/2+ cw^-^ 

pW-\-<T 

F2 {w) = / 1/2 + c(a; — o)^“^cia;/2CT 

J w — a 

= 1/2 + —^ [{w + a - - {w - a - a)'"] 

2(7K 

« l/2 + c(u>-o)'=-^ 

1 X a'^~^,Fi{l3a — a)—^ x a'^~^,F 2 {a+a) — ^ x a'^~^, F2{l3a — a)—^ x 


Fi (w) — F 2 (w) « 

cw'"-^ -c(t(;-a)'"-i 

< 

c(/3a — a)'"“^ + ca'"“^ 

< 

c(/3'"-i + l)a'"-i 


a'"-! 


w < /3a + (j 

« 1/2+ cw^-^ 


n^a pW-\-(7 

= / 1/2 + c{x — a)^~'"dx/2a F / 1/2 + a;^“^(ia:/2(7 

t/ W — a J l3a 

= 1/2 + —^ [(/3a - a)’" - {w - a - a)^ + {w + a)'" - (/3a)'"] 

2ak 

X a'"“\Fi(/3a + CT) —I x a'"”^ F2(/3a —cr) — ^ x a'"”^ F2(/3a + (T) — ^ 


= ^ + (^„ - a - a)'" - («; - a)'"] 

< c(/3 + 1)'"-^'"-! + ^[(/3a)'= - (/3a - 2a)'"] - ^[(/3 - l)'^a'" - ((/3 - l)a - a)'"] 
« c(/3 + l)'"-ia'"-i + ^[k{/3af-^2a] - ^[fc(/3 - l)'"-ia'"-V] 

= ca'"-i[(/3+l)'"-i+/3'"-i-i(/3-l)'"-i] 

X a'"-! 

w < (3a + 2(7 

F,{w) = l/2 + ^Uw + af-{w-af] 


pl3a-\-a pw-\-a 

/ 1/2 + c{x — a)^~^dx/2a + / 1/2 + cx'"“^fia;/2a 

J W — a J (3a+a 

1/2 + —^ [(/3a + a - a)'" - (w - a - o)'" + (w + a)'" - (/3a + a)'"] 
2ak 
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[Fi{j3a+a) — \ x , Fi{j5a+2a)—\ x \F2(/3a+CT) —^ x ^, F2{l3a+2a) — \ 


Fi{w) - F 2 {w) = ^—[{I3a + a)’^-{l3a + a-a)'" + {w-a-a)'"-{w-a)'^] 

2(7K 

« [(/3a + a)'^~^ka — {w — a)'^~^ka] 

2ak 

< |^[(/3a + a)'=-i-(/3af-i] 

« |^[(/3a)'=-i(l+(^^)-(/3af-i] 

= a'=-i[c/3'=-2(fc-l)/2] 


(h) When w > f3a + 2a 


Fi{w) = F 2 {w) 


That completes the proof of the first claim. 

2. When a >- a, we will prove the second proposition. 

(a) When —cr < w < 0, we are not allowed to query here. 

(b) When 0 < w < Pa 


pO PW-\-<7 

Fi{w) = {mi-kU){w) = / (1/2 — ca;|a;|*“^)(ia;/2(T + / {1/2 + cx'^~^)dx/2a 

Jw—a J 0 


= l/2+^[(t^ + a)'=-(a-n;f] 

= l/2+^a'=[(l + n;/af-(l-a;/^)'=] 


Similarly F 2 {w) « 1/2 + ca^ "^{w — a) 

[Boundaries: Fi(0)—| = 0,Fi(/3a) —| x cr^“^a, J32(0) —^ x —cr^“^a, F2(/3a) x cr^“^a] 

Fi{w) - F 2 {w) X cr*"^a. 


(c) When Pa < w < a 


1*0 pW-\-<7 

Fi{w) = = / (1/2 — ca;|x|*“^)(ia;/2cr + / {1/2 + cx^~^)dx/2c 

J w—a J 0 

= l/2 + ^[(«; + a)'=-(a-n;f] 

= l/2 + ^a'^[{l + w/ar-{l-w/ar] 

« 1/2+ ccr^“^W 
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ra J rPa+a , pw+cr i 

F 2 {w) = / (l/2-c(x-a)|x-a|'=-2)^+ / (1/2 + c{x - + i/2 + cx^-^^ 

Jw-a Ja J Pa+a 

= 1/2 + + a- w)'" + (Pa + a - a)’" + (w + a)’" - (Pa + a)^] 

Zak 

-, /« c . k(w — a). u,^ k(P—l)a. kw. kPa., 

« 1/2 + ^ -^'"(1-^- -) + cr''(l + - —) + cr'=(l + -) - cr'=(l + —) 

2(7k a a a a 

= 1/2 + ^a^~^[w — a + (P — l)a + w — Pa] 

= 1/2 + ca^~‘^ (w — a) 

[Boundaries: Fi(Pa) — 5 x Fi(ct) — 5 x ,F2{Pa) x a^~‘^a, F2 {(j) — 5 x 


—a^ ^al 


Fi(w) - F 2 (w) X 


Specifically, verify the boundary at a 


Fi(a)-F2{a) = -—[a!"-(PaFa-a)'^ + (Pa + a)'^] 

Zak 


^[a'^ - a'=(l + + a'^(l + k^)] 

Zak a a 


-[a^-\-ka^ ^a] 


< ca^ 


(d) When a <w <a + a 


pW-\-G 

Fi(w) = / (1/2 + cx'^~^)dx/2a 

J W — <7 

= V2+2|^[(^« + cT)'=-(a;-a)'=] 

F 2 (w) = / (l/2-c(x-a)|x-a|'=-2)^+ / (1/2 + c(x - ^ + / 1/2 + cx’^-^^ 

Jw-a 2cr Ja 2(7 Jpa+a 2.a 

= 1/2 + ;^[—((7 + a — re)* + (Pa + a — a)^ + (w + a)^ — (Pa + a)^] 

2ak 


Fi(w) — F 2 (w) = —-[((7 + a — — (/3a + (7 — a)* — (w — cr)^ + (/3a + cr)^] 

Zak 

Differentiating the above term with respect to w, gives ^[—(o’ + a — — (w — 

^ Q because a < w < a + a and hence Fi(w) — F 2 {w) is decreasing with w. We 
already saw Fi(a) — F 2 (a) < c(7*“^a. We can also verify that at the other boundary, 

Fi(a + ct) - 3 ^ 2(0 + O’) = ^^[-(Pa +a - a)^ - a^ + (Pa +a)^] 

Zak 

< -a^ 

- 2 
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(e) When a + a<w<f3a + a 


pW-\-<7 

Fi{w) = / {1/2 + cx^~^)dx/2a 

J W — (7 

= V2 + ^[(^« + cT)'=-(«;-a)'=] 


pPa+a 1 pw+a 

F,{w) = / (1/2+ c(x-«)'=-!)-+ / 1/2 + 

Jw — a J l3a-\-a 


CX 


k-1 


dx 


= 1/2 + —- [(/3a + a - a)'" - {w - a - a)^ + {w + a)'" - {j3a + a)'^] 

Zak 


Fi{w) - F 2 {w) = 


[{w - a - a)’^ - {j3a + a - a)'" - {w - a)’" + {j3a + a)^] 

lak 


Differentiating with respect to w gives -^[{w — a — a)^~^ — {w — < 0 because 

w — a — a < w — a and so F 1 — F 2 is decreasing with w. We know Fi(a + a) —F 2 (a+a) < 
and we can verify at the other boundary that 

Fi{^a Fa) - F2(/3a Fa) = - a)^ - {^a F a - a)^ - (/3a)^ + (/3a + a)^] 

Zak 

« ^[{/3a - af - {^af - a\l + + a'=(l + 

2ak a a 


2ak 


[(/3a — a)^ — i^a)^ F ka^ ^a] 




(f) When /3a + (j < w < /3a + 2(7 


F,{w) = l/2+—[{w + af-{w-af] 


p^a-\-<7 pw-\-a 

F 2 {w) = / 1/2 + c{x — a)’^~^dx/2a + / 1/2 + cx'^~^dx/2a 

J W — a J 0a-\-a 

= 1/2 + F~ [(/5o + O' — a)^ — {w — a — a)^ + (w + a)^ — (/3a + cr)*] 

Zka 


Hence 


Fi{w) - F 2 {w) = ^—[{I3a + a)’^-{l3a + a-a)'" + {w-a-a)'"-{w-a)’^] 

Zak 


Zak 
ca , 


[{Pa + a)^ ^ka—{w — aY ^ka] 


< -[{pa + af-^-{Paf-p 


z/2. 


+- 2 , 


c za a 


a^ 
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Alternately, by the same argument as in the previous case, differentiating with respect 
to w gives ^[{w — a — a)^~^ — {w — < 0 because w — a — a<w — a and so 

Fi — F 2 is decreasing with w. We know Fi{j3a + a) — F 2 {(ia + a) < and we 

can verify at the other endpoint that 

Fi {13a + 2a) - F 2 {Pa + 2a) = 0 

(g) When w> Pa + 2a, Fi{w) = F 2 {w) 

That completes the proof of the second proposition. 


B Convolved Regression Function, Justifying Eqs.(8-ll) 


For ease of presentation, let us assume the threshold is at 0, and define m GV{c, C, k, a) as 


1/2 + f{x) + A{x) if a; 
1/2 — f{x) if a; < 


Due to assumption (M), A(a;) must be 0 when 0 < a: < ct. Hence, the Taylor expansion of A(a;) 
around x = a looks like 


A(a;) = {x — a)A'{a) + {x — a)^A"{a) + ... 


If one represents, as before, F{x) = m*U, then directly from the definitions, it follows for 5 > 0 
that 

F{S) - F{0) = + f{z) + A(z))g - I ^^\l/2 - f{z))^ 

In particular, due to the form (T) of to, let / = Ci|a;|^“^ for some c < Ci < C (we could also 
break / into parts where it has different ciS but this is a technicality and does not change the 
behaviour). Then 


F{S) - F{0) 


Cl 

2ka 


fc\cr+(5 


Cl 

2ka 




pS-\-(T 

+ / [(z - a)A'{a) + {z - a)^A"{a) + ...] 

J <7 

r/. _ \2]a+S 

a^ + {-a + ,5)'= - {-a)'^] + A'{a) + ... 


cia 


.fc-2 i 


4ct 


A'(cr) + o((5^ 


dz 


Thus we get behaviour of the form 


F{t + h) > 1/2+ ca'^-^h 
One can derive similar results when 5 < 0. 

The claims about WIDEHIST immediately follow from the above, but we can make them a little 
more explicit. First note that F{w) = l/2+-^{w — t) for w close to t (in fact for w G \t — a, t + a]), 
as seen in Section 1 of this Appendix. Consider a bin just outside the bins i* — l,i*,i* + 1, for 
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instance bin i = i* + 2 centered at bi (note bi > t + h), and let J be the set of points j that fall 
within bi ± a j2. Define 

=+) 

' j&J 

where Yj G {±1} are observations at points j G J. Now, we have, since P{Yj = +) = F{j) 


e[k] 


1 

na/2R 




> 


1 

nu I2R 


EV2 + ^(x,-t) 


1 l•bi—t+a/2 

1/2 H— / 

^ J bi —t—al2 

1/2 + [{bi — t + cr/2)^ — {bi — t — ct/2)^] 

l/2 + ^(&,-t) 


C Justifying Claims in the Active Upper Bounds 


Phase 1 {k = 1). In the first phase of the algorithm, it is possible that a ^ Re/n but ^ Ree~^ 
- in other words the noise may be small enough that passive learning cannot make out that we 
are in the errors-in-variables setting, and then the passive estimator will get a point error of 
in each of those epochs (as if there is no feature noise). This point error is to the best point in 
epoch e, which we can prove by induction is the true threshold t with high probability. Since 
it trivially holds in the first epoch {t G Di = [—1,1]), we assume that it is true in epoch e — 1. 
Then, in epoch e, the true threshold t is still the best point if the estimator Xe-i of epoch e — 1 
was within Rf. of t, or in other words if \xe-i — t\Y R^. This would definitely hold if Y Re 

i.e. n > 2CiE = 2Ci\\og{l/a)\, which is true since a >- eyi\){—n/2Ci}. However, the algorithm 
cannot stay in this phase of cr ^ i?e/^^ this until the last epoch since a > Re+i = Re/‘2.- 

Phase 2 {k = 1). When a Y Rein, WIDEHIST gets an estimation error of C '2 in epoch 
e. This error is the distance to the best point in epoch e, which is t by the following similar 
induction. In epoch e, t is still the best point only if \xe-i — t\ < Re, i.e. Cl i.e. 

nRe > 2C|£1ct which holds since Re > a for all e < £1 and since n > 2C|£1 {a >- exp{—n/2C|} 
implies E < n/2C|). 


The final error of the algorithm is is 



o(:^) 


since Re < 2.a. 


Explanation for fc > 1 Assume a > n 2^-2 ^ otherwise active learning won’t notice the feature 
noise, and so log(l/(T) < Choose total epochs E = |'log(^)] < — Clogn for some 

C. In each epoch of length n/E in a region of radius Re = 2“®+^, we get a passive bound of 

Clwhenever cr > • (This must happen at some e < E = [log(^)] because 

1 J 

Re = 2“®+^ < 2a < aa'^^~‘^n since a >- n 2 / 0-2 and hence in the last epoch a > (^) .) 
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By the same logic as for /c = 1, we need to verify that jxe-i — t\ < Re so that if t was in the 
search space in epoch e — 1 then it remains the in the search space in epoch e, i.e. we want to 
verify C'l < Rl which is true since Re > a and > 2C\Eln . 

(By choice of i? = |'log(^)], Re > Re > > Re+i ■ Since a >- n~^>^ we get > 2C\Eln 

since E < Clogn .) 


The final point error is given by the passive algorithm in the last epoch as 


Re < 2(7 and E < Clogn, this becomes ^ 





Re . 


since 
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